A supplement to the manuscript:
An Epidemics Model for Non-First-Order Transmission Kinetics
Eun-Young Mun$^{*1}$ and Feng Geng$^{2}$
$^{1}$ Department of Health Behavior and Health Systems, School of Public Health, University of North Texas Health Science Center, Fort Worth, TX, USA
$^{2}$ School of Professional Studies, Northwestern University, Chicago, IL, USA
Background
Data analytics is a powerful tool for extracting knowledge from data. Few bodys of knowledge is more vital for human welfare than that of the pandemic-causing virus. How does such a virus spread? What makes it spread fast or slowly? What measures are most effective suppressing it? The answers to these questions are prerequisite to any knowledge based actions in combating a pandemic.
At present time of September 2020, Covid-19 pandemic has running for six months and is not showing any sign of abating in the US. The davastating impact of pandemics illustrate the importance of developing the capability to quickly gain knowledge on any emerging infectious agent in general. Meanwhile, real data collected on this specific corona virus is a great source to develop and practice such capability, and build knowledge to guide policy making in combating the ongoing pandemic.
Aim
This project is aimed at analyzing real Covid-19 data of multiple counties, and creating models that correlate the disease growth rate with the demographic characteristics and the disease control measures taken by the countries. From these discoveries, recommendations of policies or action plans will be made.
Original Data
Covid-19 data used in this project is made available by Our World in Data (ourworldindata.org), at the location: https://github.com/owid/covid-19-data/tree/master/public/data.
The data downloaded for this project spans from January 1st to June 30th, 2020. The CSV files follow a format of 1 row per location and date. The locations in this dataset are in most cases countries, as well as some regions that are administered independently, such as Taiwan. The variables represent all of the main data related to confirmed cases, deaths, and testing, as well as other variables of potential interest.
Since June 30th, there has been new parameters added to the dataset. For the convenience of readers , the dataset downloaded on June 30th is attached as 'owid-covid-data.csv'. Note that to run the code posted here, one needs to change the path accordingly.
import os
import numpy as np
import pandas as pd
from collections import Counter
import math
import statistics as stat
import missingno as msno
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer as Impute
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
import statsmodels.api as sm
from statsmodels.formula.api import glm
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import seaborn as sns
from datetime import datetime
date_str = datetime.now().strftime('%Y-%m-%d')
pd.set_option('display.max_column', None)
pd.set_option('display.max_row', None)
pd.set_option('display.max_seq_items', None)
path = "C:\\Users\\80136730\\DataScience\\Covid19\\Original data"
file = 'owid-covid-data.csv'
data = pd.read_csv(os.path.join(path, file))
# quick survey of features
data.columns
# shorten some feature names
data.columns = ['loc_code', 'continent', 'location', 'date', 'total_cases', 'new_cases',
'total_deaths', 'new_deaths', 'total_cases_pmm',
'new_cases_pmm', 'total_deaths_pmm',
'new_deaths_pmm', 'total_tests', 'new_tests',
'total_tests_pm', 'new_tests_pm',
'new_tests_sm', 'new_tests_pmsm', 'tests_units',
'stringency_index', 'population', 'pop_den', 'median_age',
'older_65', 'older_70', 'gdp_pc', 'extreme_poverty',
'cvd_death_rate', 'diabetes_prevalence', 'female_smokers',
'male_smokers', 'handwashing_facilities', 'hospital_beds_pm',
'life_expectancy']
# many country names like 'South Korea', 'Cote_d'Ivoire' will cause problem
data.replace({'location': {" ":"_", "'":"_", "-":"_"}}, regex=True, inplace=True)
data.replace({'continent': {" ":"_", "'":"_", "-":"_"}}, regex=True, inplace=True)
# analysis will be using "new cases per million (pmm)". Drop some variables (columns).
df1 = data.drop(['total_cases', 'new_cases','total_deaths', 'new_deaths',
'total_tests', 'new_tests','total_tests_pm', 'new_tests_pm',
'new_tests_sm', 'tests_units'], axis=1)
# remove aggregate data and small country data
df1.drop(df1[df1['location'].isin(['International','World'])].index, inplace=True)
df1.drop(df1[df1['population'] < 1000000].index, inplace=True)
# replace some negative numbers (due to data adjustments?) with zero
columns = ['total_cases_pmm', 'new_cases_pmm', 'total_deaths_pmm', 'new_deaths_pmm']
for col in columns:
df1.loc[df1[col]<0, col] = 0
# convert 4 key variables to 7 day rolling average
for column in columns:
new_col=[]
for row in sorted(set(df1['location'])):
section = df1.loc[df1['location']==row, column]
new_col.append(section.rolling(window=7).mean().to_list())
df1['{}r'.format(column)] = [item for sublist in new_col for item in sublist]
# now drop the original unsmoothed columns
df2 = df1.drop(columns = columns)
# there are a lot of days with zero case at the beginning, drop them
df3 = df2[df2['total_cases_pmmr']>=1].copy()
# make a time variable marking the days of epidemic for each country
days = [1]*df3.shape[0]
for i in range(1, df3.shape[0]):
if df3.iloc[i]['location']==df3.iloc[i-1]['location']:
days[i] = days[i-1] +1
else:
pass
df3['days'] = days
# some locations are missing population density data
set(df3[df3['pop_den'].isna()].loc_code)
# from www.worldometers.info, SSD, SYR and TWN pop_den in per square km
df3.loc[df3['loc_code']=='SSD', 'pop_den']=18
df3.loc[df3['loc_code']=='SYR', 'pop_den']=95
df3.loc[df3['loc_code']=='TWN', 'pop_den']=651
df3['newCase_den'] = df3['new_cases_pmmr']*df3['pop_den'] #the unit is cases per million square km
# make a two-week cumulative of new cases as approximation for active cases
active = df3.groupby('location', as_index=False).rolling(14, min_periods=1).newCase_den.sum()
df3['active'] = active.reset_index(level=0, drop=True)
# convert key variables to logarithm
df3['log $t$'] = np.log(df3.days+1)
df3['log_total_case'] = np.log(df3.total_cases_pmmr+1)
df3['log_new_case'] = np.log(df3.new_cases_pmmr+1)
df3['log_total_death']= np.log(df3['total_deaths_pmmr']+1)
df3['log_new_death']= np.log(df3.new_deaths_pmmr+1)
df3['log_pop']= np.log(df3['population'])
df3['log_pop_den']= np.log(df3['pop_den'])
df3['log [$I$]']= np.log(df3['active'])
# drop the old feature columns
df4 = df3.drop(columns=['date', 'log_total_death', 'log_new_death',
'total_deaths_pmmr', 'new_deaths_pmmr', 'active'])
# drop countries that have less than 2.4 total_cases_pmmr at day 14
drop_list24 = df4[(df4['days']==14) & (df4['total_cases_pmmr']<=2.4)].loc_code.tolist()
df5 = df4[~df4['loc_code'].isin(drop_list24)].copy()
#df5.to_csv('df5 {date}.csv'.format(date=date_str),index=False)
# Next, truncate off the data after log_days>2.5 (12 days) or log_days>3 (20 days)
df5_trunc25 = df5[df5['log $t$']<=2.5].copy()
df5_trunc30 = df5[df5['log $t$']<=3.0].copy()
# df5_trunc25.to_csv('df5_trunc25 {date}.csv'.format(date=date_str),index=False)
# df5_trunc30.to_csv('df5_trunc30 {date}.csv'.format(date=date_str),index=False)
# visualize the dataset
fig, axs = plt.subplots(2, 2, figsize=(9,6), dpi=300, sharex=True)
sns.lineplot(ax=axs[0,0], x='log $t$', y='log [$I$]', data=df5.loc[df5['continent']=='Africa'],
hue="loc_code", palette="colorblind", legend=None, linewidth=0.5)
sns.scatterplot(ax=axs[0,0], x='log $t$', y='log [$I$]', data=df5.loc[df5['continent']=='Africa'],
hue="loc_code", palette="colorblind", legend=None, s=10)
axs[0,0].legend(title='Africa', loc='upper left', title_fontsize='12',frameon=False)
axs[0,0].set_ylabel("ln[$I$]", fontsize='12')
axs[0,0].yaxis.set_major_locator(MaxNLocator(integer=True))
sns.lineplot(ax=axs[0,1], x='log $t$', y='log [$I$]', data=df5.loc[df5['continent']=='Asia'],
hue="loc_code", palette="colorblind", legend=None, linewidth=0.5)
sns.scatterplot(ax=axs[0,1], x='log $t$', y='log [$I$]', data=df5.loc[df5['continent']=='Asia'],
hue="loc_code", palette="colorblind", legend=None, s=10)
axs[0,1].legend(title='Asia', loc='upper left', title_fontsize='12',frameon=False)
axs[0,1].set_ylabel("")
axs[0,1].yaxis.set_major_locator(MaxNLocator(integer=True))
sns.lineplot(ax=axs[1,0], x='log $t$', y='log [$I$]', data=df5.loc[df5['continent'].isin(['Europe','Oceania'])],
hue="loc_code", palette="colorblind", legend=None, linewidth=0.5)
sns.scatterplot(ax=axs[1,0], x='log $t$', y='log [$I$]', data=df5.loc[df5['continent'].isin(['Europe','Oceania'])],
hue="loc_code", palette="colorblind", legend=None, s=10)
axs[1,0].legend(title='Europe & Oceania', loc='upper left', title_fontsize='12',frameon=False)
axs[1,0].set_ylabel("ln[$I$]", fontsize='12')
axs[1,0].set_xlabel("ln $t$", fontsize='12')
axs[1,0].yaxis.set_major_locator(MaxNLocator(integer=True))
sns.lineplot(ax=axs[1,1], x='log $t$', y='log [$I$]', data=df5.loc[df5['continent'].isin(['North_America','South_America'])],
hue="loc_code", palette="colorblind", legend=None, linewidth=0.5)
sns.scatterplot(ax=axs[1,1], x='log $t$', y='log [$I$]', data=df5.loc[df5['continent'].isin(['North_America','South_America'])],
hue="loc_code", palette="colorblind", legend=None, s=10)
axs[1,1].legend(title='Americas', loc='upper left', title_fontsize='12',frameon=False)
axs[1,1].set_ylabel("")
axs[1,1].set_xlabel("ln $t$", fontsize='12')
axs[1,1].yaxis.set_major_locator(MaxNLocator(integer=True));
sns.set(font='Arial')
sns.set_style('white')
fig, axs = plt.subplots(2, 2, figsize=(9, 6), dpi=300, sharex=True)
sns.lineplot(ax=axs[0,0], x='log $t$', y='log [$I$]', data=df5_trunc25.loc[df5_trunc25['continent']=='Africa'],
hue="loc_code", palette="colorblind", legend=None, linewidth=0.5)
sns.scatterplot(ax=axs[0,0], x='log $t$', y='log [$I$]', data=df5_trunc25.loc[df5_trunc25['continent']=='Africa'],
hue="loc_code", palette="colorblind", legend=None, s=10)
axs[0,0].legend(title='Africa', loc='upper left', title_fontsize='12',frameon=False)
axs[0,0].set_ylabel("ln[$I$]", fontsize='12')
sns.lineplot(ax=axs[0,1], x='log $t$', y='log [$I$]', data=df5_trunc25.loc[df5_trunc25['continent']=='Asia'],
hue="loc_code", palette="colorblind", legend=None, linewidth=0.5)
sns.scatterplot(ax=axs[0,1], x='log $t$', y='log [$I$]', data=df5_trunc25.loc[df5_trunc25['continent']=='Asia'],
hue="loc_code", palette="colorblind", legend=None, s=10)
axs[0,1].legend(title='Asia', loc='upper left', title_fontsize='12',frameon=False)
axs[0,1].set_ylabel("")
axs[0,1].yaxis.set_major_locator(MaxNLocator(integer=True))
sns.lineplot(ax=axs[1,0], x='log $t$', y='log [$I$]', data=df5_trunc25.loc[df5_trunc25['continent'].isin(['Europe','Oceania'])],
hue="loc_code", palette="colorblind", legend=None, linewidth=0.5)
sns.scatterplot(ax=axs[1,0], x='log $t$', y='log [$I$]', data=df5_trunc25.loc[df5_trunc25['continent'].isin(['Europe','Oceania'])],
hue="loc_code", palette="colorblind", legend=None, s=10)
axs[1,0].legend(title='Europe & Oceania', loc='upper left', title_fontsize='12',frameon=False)
axs[1,0].set_ylabel("ln[$I$]", fontsize='12')
axs[1,0].set_xlabel("ln $t$", fontsize='12')
sns.lineplot(ax=axs[1,1], x='log $t$', y='log [$I$]', data=df5_trunc25.loc[df5_trunc25['continent'].isin(['North_America','South_America'])],
hue="loc_code", palette="colorblind", legend=None, linewidth=0.5)
sns.scatterplot(ax=axs[1,1], x='log $t$', y='log [$I$]', data=df5_trunc25.loc[df5_trunc25['continent'].isin(['North_America','South_America'])],
hue="loc_code", palette="colorblind", legend=None, s=10)
axs[1,1].legend(title='Americas', loc='upper left', title_fontsize='12',frameon=False)
axs[1,1].set_ylabel("")
axs[1,1].set_xlabel("ln $t$", fontsize='12');
plot = msno.matrix(df5_trunc25)
plt.gcf().subplots_adjust(top=0.35)
fig = plot.get_figure()
#fig.savefig('missing_acti.tif')
# quick look at the number of missing values, by columns
df5_trunc25.isna().sum().sort_values().tail(10)
# Impute missing data in numeric independent variables
# separate object columns for missing data imputation and scaling
# also, drop some columns with too much missing data
# put aside some veriables that either are not numeric, or have too many missing
X = df5_trunc25.drop(['loc_code', 'continent', 'location', 'new_tests_pmsm', 'handwashing_facilities', 'extreme_poverty'], axis=1)
X.isna().sum().sort_values()
# use Impute to fill in each row's missing features
# dummy codes should help imputing, but it takes long time!
imp = Impute(max_iter = 100, random_state = 123)
X_imputed = pd.DataFrame(imp.fit_transform(X), columns=X.columns, index=X.index)
# scale the imputed independant parameters
scaler = MinMaxScaler(feature_range=(0, 100))
# don't scale pop_den, days, log t, or the dependent variables
X = X_imputed.drop(['pop_den','log $t$', 'days', 'log [$I$]'], axis=1)
scaler100 = scaler.fit(X)
array_X = scaler100.transform(X).round(2)
X_scaled = pd.DataFrame(array_X, columns=X.columns, index=X.index)
plt.figure(figsize=(15,5))
ax = sns.boxplot(data=X_scaled)
ax.set_xticklabels(ax.get_xticklabels(), rotation=-45, ha='left', fontsize=16)
ax.tick_params(axis='y', labelsize=16);
fig = ax.get_figure();
df6 = pd.concat([df5_trunc25[['loc_code', 'location', 'continent', 'days', 'log $t$', 'log [$I$]', 'new_tests_pmsm']], X_imputed['pop_den'], X_scaled], axis=1)
df6.shape
#df6.to_csv('df6 {date}.csv'.format(date=date_str), index=False)
〖 𝐥𝐧〗[𝑰]= ( 𝐥𝐧〖[(𝟏−𝒏)〗 𝒌])/(𝟏−𝒏)+𝟏/(𝟏−𝒏)∙𝐥𝐧𝒕
# linear regression log[I]~logt
coef = []
inte = []
r2 = []
tests = []
index_d = []
locCode = []
aic = []
bic = []
for loca in sorted(set(df6.loc_code)):
subset = df6[df6.loc_code == loca]
mi = subset['log [$I$]'].idxmax()
y=subset['log [$I$]']
X=subset['log $t$']
X = sm.add_constant(X)
ols = sm.OLS(y,X).fit()
aic.append(ols.aic)
bic.append(ols.bic)
if len(ols.params.values)==2:
coef.append(ols.params.values[1])
inte.append(ols.params.values[0])
r2.append(ols.rsquared)
locCode.append(loca)
index_d.append(mi)
tests.append(subset['new_tests_pmsm'].mean())
else:
print(loca)
stat.mean(aic), stat.mean(bic)
# collect the results
mydict = { 'loc_code':locCode, 'tests':tests, 'inte_ld':inte, 'coef_ld':coef, 'r2_ld':r2}
collect = pd.DataFrame(mydict, index=index_d)
# calculate n, m and ko
collect['$n$'] = collect['coef_ld'].apply(lambda x: (1-1/x))
collect['$m$'] = collect['$n$'].apply(lambda x: 1/x)
c = []
for i in range(collect.shape[0]):
k = collect['coef_ld'].iloc[i]*(math.exp(collect['inte_ld'].iloc[i]/collect['coef_ld'].iloc[i]))
c.append(k)
collect['$k$'] = c
collect['$ko$'] = [999]*collect.shape[0]
for loca in collect['loc_code']:
collect.loc[collect['loc_code']==loca, '$ko$'] = collect.loc[collect['loc_code']==loca, '$k$']/df6.loc[df6['loc_code']==loca, 'pop_den']
collect.head()
collect.shape
# stats of key parameters
stat.mean(collect['$n$']), stat.stdev(collect['$n$']),stat.variance(collect['$n$']),
stat.mean(collect['$ko$']), stat.stdev(collect['$ko$']),stat.variance(collect['$ko$']),
stat.mean(collect['$m$']), stat.stdev(collect['$m$']),stat.variance(collect['$m$']),
# let's look at the extreme outliers
n_extreme = collect[(collect['$n$']>(stat.mean(collect['$n$'])+3*stat.stdev(collect['$n$'])))|(collect['$n$']<(stat.mean(collect['$n$'])-3*stat.stdev(collect['$n$'])))].index
ko_extreme = collect[(collect['$ko$']>(stat.mean(collect['$ko$'])+3*stat.stdev(collect['$ko$'])))|(collect['$ko$']<(stat.mean(collect['$ko$'])-3*stat.stdev(collect['$ko$'])))].index
n_extreme, ko_extreme
collect.loc[[16361, 18766, 7949]]
collect.drop([16361, 18766, 7949]).mean(axis=0)
# visualize the results
from matplotlib.gridspec import GridSpec
from scipy import stats
x=collect['$ko$']
y=collect['$n$']
sns.set(font='Arial')
sns.set_style('ticks')
fig = plt.figure(figsize=(5,5))
gs = GridSpec(12,18)
gs.update(wspace=0.05, hspace=0.05)
ax_joint = fig.add_subplot(gs[4:12, 0:14])
ax_marg_x = fig.add_subplot(gs[0:4,0:14])
ax_marg_y = fig.add_subplot(gs[4:12, 14:18])
ax_joint.scatter(x,y, alpha=0.6 )
ax_marg_x.hist(x, bins=60)
kde_x = stats.gaussian_kde(x)
space_x = np.linspace(min(x),max(x),15)
ax_marg_x.plot(space_x, kde_x(space_x), linewidth=1)
ax_marg_y.hist(y,bins=40, orientation="horizontal")
kde_y = stats.gaussian_kde(y)
space_y = np.linspace(min(y),max(y),20)
ax_marg_y.plot( kde_y(space_y), space_y, linewidth=1)
ax_joint.set_xlim((0, 1.5))
ax_joint.set_ylim((0, 0.8))
ax_marg_x.set_xlim((0, 1.5))
ax_marg_y.set_ylim((0, 0.8))
ax_marg_x.axis('off')
ax_marg_y.axis('off')
# Set labels on joint
ax_joint.set_ylabel('$n$', rotation=0, fontsize=16, labelpad=10)
ax_joint.set_xlabel('$k$', fontsize=16)
# linear regression log[I]~t to extract r^2_t for comparison
coef = []
inte = []
r2d = []
location = []
aic = []
bic = []
for loca in sorted(set(df6.loc_code)):
subset = df6[df6.loc_code == loca]
mi = subset['log [$I$]'].idxmax()
y=subset['log [$I$]']
X=subset['days']
X = sm.add_constant(X)
ols = sm.OLS(y,X).fit()
aic.append(ols.aic)
bic.append(ols.bic)
if len(ols.params.values)==2:
coef.append(ols.params.values[1])
inte.append(ols.params.values[0])
r2d.append(ols.rsquared)
location.append(loca)
else:
print(loca)
stat.mean(aic), stat.mean(bic)
collect['r2_d'] = r2d
collect.head()
# compare the r^2 values of the two models
r_sq = collect[['r2_ld','r2_d']].copy()
r_sq.rename(columns={'r2_ld': 'ln[$I$] ~ ln $t$', 'r2_d': 'ln[$I$] ~ $t$'}, inplace=True)
r_sq_melt = pd.melt(r_sq)
r_sq_melt.rename(columns={'variable': 'Regression', 'value': '$R^2$'}, inplace=True)
#r_sq_melt.to_csv('r_sq_melt {d}.csv'.format(d=date_str))
plt.figure(figsize=(6,6))
ax= sns.histplot(r_sq_melt, x='$R^2$', hue='Regression', bins=50, kde=True, palette="tab10")
# extract dmographic and socioeconomical data
demogr = ['loc_code', 'location', 'continent', 'pop_den','median_age', 'life_expectancy', 'older_65','older_70',
'gdp_pc', 'cvd_death_rate','diabetes_prevalence', 'female_smokers', 'male_smokers','hospital_beds_pm']
demo_data = df6[demogr].loc[index_d].copy()
# combine it with disease characteristic parameters
df7 = pd.concat([collect, demo_data], axis=1)
#df7.to_csv('df7 {date}.csv'.format(date=date_str), index=False)
# The data correlation heatmap without the case numbers
Var_Corr = round(df7.corr(), 2)
dropSelf = np.zeros_like(Var_Corr)
dropSelf[np.triu_indices_from(dropSelf)] = True
plt.figure(figsize = (12,8))
g = sns.heatmap(Var_Corr, xticklabels=Var_Corr.columns, yticklabels=Var_Corr.columns,cmap='coolwarm', vmax=0.8,
center=0,annot=True, mask=dropSelf)
g.set_xticklabels(g.get_xticklabels(), rotation=-45, ha='left');
df8 = df7.drop(columns=['life_expectancy', 'older_65', 'older_70'])
Var_Corr = round(df8.corr(), 2)
dropSelf = np.zeros_like(Var_Corr)
dropSelf[np.triu_indices_from(dropSelf)] = True
plt.figure(figsize = (12,8))
g = sns.heatmap(Var_Corr, xticklabels=Var_Corr.columns, yticklabels=Var_Corr.columns,cmap='coolwarm', vmax=0.8,
center=0,annot=True, mask=dropSelf)
g.set_xticklabels(g.get_xticklabels(), rotation=-45, ha='left');
df9 = pd.concat([df8, pd.get_dummies(df8['continent'], prefix='con')],axis=1).dropna()
# this is one-hot encoding, now drop one of the continents to make it a dummy encoding
# and drop the original 'continent' column
df9.drop(['continent', 'con_Europe'],axis=1, inplace=True)
X9 = df9[['pop_den', 'median_age', 'gdp_pc',
'cvd_death_rate', 'diabetes_prevalence', 'female_smokers',
'male_smokers', 'hospital_beds_pm', 'tests', 'con_Africa', 'con_Asia',
'con_North_America', 'con_Oceania', 'con_South_America']].copy()
y9 = df9[['$n$','$ko$']].copy()
# Modeling for n
# Screen for the optimum n-estimator and max_features selector.
ensemble_models = [
("RandomForestRegressor, max_features='sqrt'",
RandomForestRegressor(n_estimators=100,
warm_start=True, oob_score=True,
max_features="sqrt",
random_state=123)),
("RandomForestRegressor, max_features='auto'",
RandomForestRegressor(n_estimators=100,
warm_start=True, oob_score=True,
max_features="auto",
random_state=123))]
# Map classifier names to a list of (<n_estimators>, <error rate>) pairs.
from collections import OrderedDict
error_rate = OrderedDict((label, []) for label, _ in ensemble_models)
# Set the range of `n_estimators` values to explore.
min_estimators = 20
max_estimators = 300
# Screen for the optimum n-estimator and max_features selector.
for label, model in ensemble_models:
for i in range(min_estimators, max_estimators + 1):
model.set_params(n_estimators=i)
model.fit(X9, y9['$n$'])
# Record the OOB error for each `n_estimators=i` setting.
oob_error = 1 - model.oob_score_
error_rate[label].append((i, oob_error))
# Generate the "OOB error rate" vs. "n_estimators" plot.
for label, model_err in error_rate.items():
xs, ys = zip(*model_err)
plt.plot(xs, ys, label=label)
plt.xlim(min_estimators, max_estimators)
plt.xlabel("Number of trees")
plt.ylabel("OOB error rate")
plt.legend(loc="best")
rf_1 = RandomForestRegressor(n_estimators= 20, warm_start=True, max_features='sqrt',
oob_score=True, random_state=123)
rf_1.fit(X9, y9['$n$'])
print("oob_score:", rf_1.oob_score_)
print(" score:", rf_1.score(X9, y9['$n$']))
### Importance analysis
importances = list(rf_1.feature_importances_)
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(list(X9.columns), importances)]
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
importance_df = pd.DataFrame(feature_importances)
importance_df.columns = ['features', 'importance']
#importance_df.to_csv('importance_n.csv')
plt.figure(figsize = (12,5))
g = sns.barplot(x="features", y="importance", data=importance_df[importance_df['importance']>0.0001])
g.set_xticklabels(g.get_xticklabels(), rotation=-45, ha='left')
g.tick_params(labelsize=15)
g.set_xlabel('Features', fontsize=20)
g.set_ylabel('Importance', fontsize=20)
fig=g.get_figure()
fig.subplots_adjust(bottom=0.45)
# Modeling for ko
# Screen for the optimum n-estimator and max_features selector.
ensemble_models = [
("RandomForestRegressor, max_features='sqrt'",
RandomForestRegressor(n_estimators=100,
warm_start=True, oob_score=True,
max_features="sqrt",
random_state=123)),
("RandomForestRegressor, max_features='auto'",
RandomForestRegressor(n_estimators=100,
warm_start=True, oob_score=True,
max_features="auto",
random_state=123))]
# Map classifier names to a list of (<n_estimators>, <error rate>) pairs.
from collections import OrderedDict
error_rate = OrderedDict((label, []) for label, _ in ensemble_models)
# Set the range of `n_estimators` values to explore.
min_estimators = 20
max_estimators = 300
for label, model in ensemble_models:
for i in range(min_estimators, max_estimators + 1):
model.set_params(n_estimators=i)
model.fit(X9, np.log(y9['$ko$']))
# Record the OOB error for each `n_estimators=i` setting.
oob_error = 1 - model.oob_score_
error_rate[label].append((i, oob_error))
# Generate the "OOB error rate" vs. "n_estimators" plot.
for label, model_err in error_rate.items():
xs, ys = zip(*model_err)
plt.plot(xs, ys, label=label)
plt.xlim(min_estimators, max_estimators)
plt.xlabel("Number of trees")
plt.ylabel("OOB error rate")
plt.legend(loc="best")
rf_1 = RandomForestRegressor(n_estimators= 75, warm_start=True, max_features='auto',
oob_score=True, random_state=123)
rf_1.fit(X9, np.log(y9['$ko$']))
print("oob_score:", rf_1.oob_score_)
print(" score:", rf_1.score(X9, np.log(y9['$ko$'])))
### Importance analysis
importances = list(rf_1.feature_importances_)
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(list(X9.columns), importances)]
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
importance_df = pd.DataFrame(feature_importances)
importance_df.columns = ['features', 'importance']
#importance_df.to_csv('importance_ko.csv')
plt.figure(figsize = (12,5))
g = sns.barplot(x="features", y="importance", data=importance_df[importance_df['importance']>0.000001])
g.set_xticklabels(g.get_xticklabels(), rotation=-45, ha='left')
g.tick_params(labelsize=15)
g.set_xlabel('Features', fontsize=20)
g.set_ylabel('Importance', fontsize=20)
fig=g.get_figure()
fig.subplots_adjust(bottom=0.45)
imp_ko = importance_df.sort_values(by=['importance'], ascending=False)